Main goals:
(I will be walking through the code for illustrative purposes, but I can't teach you how to program in 20 minutes!)
# Import basic functions
%matplotlib inline
import matplotlib.pyplot as plt
import numpy as np
import pandas as pd
from copy import deepcopy
pd.set_option('display.max_columns', 50)
This is a useful, publicly available dataset for demonstrating some common data science techniques (data source). We'll build some toy examples here, but the methods/principles are easily generalizable to other datasets.
# Load in raw profiles
dating_data = pd.read_csv("./dating_data/profiles_sample.csv", index_col=0)
dating_data.head()
dating_data.shape
In business contexts: similar methods can be used to use somebody's profile on your website to predict whether they would be interested in your product.
# Let's use just these features to try to predict a person's age
# (I'm excluding variables like "kids", which might be dead giveaways.)
prof_cols = ['body_type', 'diet', 'drinks', 'drugs', 'education', 'location', 'job', 'orientation', 'sex', 'smokes', 'speaks']
dating_data[prof_cols].head()
Question: How do we get a computer to "understand" a person's dating profile?
Answer: Math! (matrices, linear algebra).
# Most columns are "categorical"
# e.g., for whether or not someone drinks alcohol, they
# can choose from among the following categories:
dating_data.drinks.unique()
# To convert this data into a matrix, we will take each
# category and convert it into a binary column:
dating_data.drinks.str.get_dummies().head(n=20)
# Note: data is often very messy
# Lots of work in data science is just cleaning/processing data
# Example:
dating_data.pets.unique()
# I've done the processing work ahead of time for
# the rest of the columns in the dataset
# Load in pre-processed data:
profile_features = pd.read_csv("./dating_data/profile_features.csv", index_col=0)
profile_features.head(n=10)
# How to define outcome variable (age)?
age = dating_data.age
age.head()
_ = plt.hist(age)
_ = plt.title("Distribution of ages in dataset")
# In most applications, you probably don't need super
# fine precision, i.e., someone's exact age
# Here, we wil "discretize" age into a categorical variable:
# Binary definition; i.e., "is 30 yrs old or younger"
age_30 = (age <= 30)
age_30.head()
# Categorical definition:
# Define bin boundaries
bins = [0,20,30,40,50,100]
# Use pd.cut function to bin the data
category = pd.cut(age,bins)
age_bins = category.apply(lambda x: str(x))
age_bins.head()
# Building a basic logistic regression classifier
# using profile features to predict age
from sklearn.linear_model import LogisticRegression
age_logit = LogisticRegression()
age_logit.fit(profile_features, age_30)
logit_predictions = pd.DataFrame({
"prediction": age_logit.predict(profile_features),
"ground_truth": age_30
})
logit_predictions['correct'] = (logit_predictions.prediction == logit_predictions.ground_truth)
logit_predictions.head(n=10)
# We usually think of "True" as 1 and "False" as 0
logit_predictions.astype(int).head()
# Evaluate overall accuracy:
logit_accuracy = logit_predictions.correct.mean()
print("Logistic regression accuracy: {:.2f}%".format(logit_accuracy*100))
We'll try making the same prediction, using different machine learning models:
# Logistic regression
from sklearn.linear_model import LogisticRegression
age_logit = LogisticRegression()
age_logit.fit(profile_features, age_30)
round((age_logit.predict(profile_features)==age_30).mean()*100, 2)
# Decision Tree
from sklearn.tree import DecisionTreeClassifier
age_dt = DecisionTreeClassifier(max_depth=15, min_samples_leaf=5)
age_dt.fit(profile_features, age_30)
round((age_dt.predict(profile_features)==age_30).mean()*100, 2)
# Random forest
from sklearn.ensemble import RandomForestClassifier
age_rf = RandomForestClassifier(n_estimators=100, max_depth=20, min_samples_leaf=5)
age_rf.fit(profile_features, age_30)
round((age_rf.predict(profile_features)==age_30).mean()*100, 2)
If you know what cross-validation is, this is just a short demonstration on how to compare the various models using out-of-sample, cross-validated accuracy measures.
from sklearn.model_selection import cross_validate
scoring = {
"accuracy": "accuracy",
"precision": "precision",
"recall": "recall",
"f1": "f1_macro"
}
logit_clf = LogisticRegression()
scoring_obj = cross_validate(logit_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)
for sc in scoring.keys():
print("{: >10}: {:.3f}".format(sc, scoring_obj["test_"+sc].mean()))
dt_clf = DecisionTreeClassifier(max_depth=15, min_samples_leaf=5)
scoring_obj = cross_validate(dt_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)
for sc in scoring.keys():
print("{: >10}: {:.3f}".format(sc, scoring_obj["test_"+sc].mean()))
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=20, min_samples_leaf=5)
scoring_obj = cross_validate(rf_clf, profile_features, age_30, scoring=scoring, cv=5, return_train_score=False)
for sc in scoring.keys():
print("{: >10}: {:.3f}".format(sc, scoring_obj["test_"+sc].mean()))
How can we improve performance? One idea: use text inputs from user profiles.
dating_data[[c for c in dating_data.columns if c.startswith("essay")]].head()
Working with text is messy and training vector models can take a long time. I've done essentially all the hard work ahead of time. Details on what I've done:
Result below:
text_features = pd.read_csv("./dating_data/text_features.csv", index_col=0)
text_features.head()
# Using embedding of text data to predict age:
age_logit = LogisticRegression()
age_logit.fit(text_features, age_30)
(age_logit.predict(text_features)==age_30).mean()
# What happens if we combine the profile characteristics and text features?
combined_features = np.hstack((text_features.values, profile_features.values))
age_logit = LogisticRegression()
age_logit.fit(combined_features, age_30)
(age_logit.predict(combined_features)==age_30).mean()
# What about using fancy methods with fancy word embeddings?
age_rf = RandomForestClassifier(n_estimators=50, max_depth=40, min_samples_leaf=10)
age_rf.fit(text_features, age_30)
(age_rf.predict(text_features)==age_30).mean()
# BE WARY! This is "in-sample" fit; predictions on "out-of-sample"
# data are actually no better than logistic regression in this case
logit_clf = LogisticRegression()
scoring_obj = cross_validate(logit_clf, text_features, age_30, scoring=scoring, cv=5, return_train_score=False)
for sc in scoring.keys():
print("{: >10}: {:.3f}".format(sc, scoring_obj["test_"+sc].mean()))
rf_clf = RandomForestClassifier(n_estimators=100, max_depth=40, min_samples_leaf=5)
scoring_obj = cross_validate(rf_clf, text_features, age_30, scoring=scoring, cv=5, return_train_score=False)
for sc in scoring.keys():
print("{: >10}: {:.3f}".format(sc, scoring_obj["test_"+sc].mean()))
import pkg_resources
import types
def get_imports():
for name, val in globals().items():
if isinstance(val, types.ModuleType):
# Split ensures you get root package,
# not just imported function
name = val.__name__.split(".")[0]
elif isinstance(val, type):
name = val.__module__.split(".")[0]
# Some packages are weird and have different
# imported names vs. system/pip names. Unfortunately,
# there is no systematic way to get pip names from
# a package's imported name. You'll have to had
# exceptions to this list manually!
poorly_named_packages = {
"PIL": "Pillow",
"sklearn": "scikit-learn"
}
if name in poorly_named_packages.keys():
name = poorly_named_packages[name]
yield name
imports = list(set(get_imports()))
# The only way I found to get the version of the root package
# from only the name of the package is to cross-check the names
# of installed packages vs. imported packages
requirements = []
for m in pkg_resources.working_set:
if m.project_name in imports and m.project_name!="pip":
requirements.append((m.project_name, m.version))
for r in requirements:
print("{}=={}".format(*r))